Designing toxicogenomics studies that use DNA array technology.

BACKGROUND
Bioassays are routinely used to evaluate the toxicity of test agents. Experimental designs for bioassays are largely encompassed by fixed effects linear models. In toxicogenomics studies where DNA arrays measure mRNA levels, the tissue samples are typically generated in a bioassay. These measurements introduce additional sources of variation, which must be properly managed to obtain valid tests of treatment effects.


RESULTS
An analysis of covariance model is developed which combines a fixed-effects linear model for the bioassay with important variance components associated with DNA array measurements. These models can accommodate the dominant characteristics of measurements from DNA arrays, and they account for technical variation associated with normalization, spots, dyes, and batches as well as the biological variation associated with the bioassay. An example illustrates how the model is used to identify valid designs and to compare competing designs.


CONCLUSIONS
Many toxicogenomics studies are bioassays which measure gene expression using DNA arrays. These studies can be designed and analyzed using standard methods with a few modifications to account for characteristics of array measurements, such as multiple endpoints and normalization. As much as possible, technical variation associated with probes, dyes, and batches are managed by blocking treatments within these sources of variation. An example shows how some practical constraints can be accommodated by this modelling and how it allows one to objectively compare competing designs.


Background
Many toxicogenomics studies are intended to evaluate the response of the transcriptome to test agents. Observed changes in mRNA concentrations in tissues can indicate the mechanism of toxicity, at least in concept (Chin and Kong, 2002;Guerreiro et al. 2003;Hamadeh et al. 2002;Irwin et al. 2004;Lord et al. 2006;Luhe et al. 2005;Nuwaysir et al. 1999;Waring and Halbert, 2002;Waters et al. 2003). However, the analysis and interpretation of these studies confront many problems that are often compounded by poor designs (Lay et al. 2006).
The response of an organism to applied treatments or selected characteristics is evaluated in a bioassay. The bioassay, that underlies most toxicogenomics studies conducted at the National Center for Toxicological Research, is to treat animals of specifi ed phenotypes (e.g. species, sex, strain) or conditions (e.g. ages, calorie restricted) with various levels of test agents (e.g. doses, exposure regimes). The study designs are typical of bioassays routinely used in toxicology. In fact, tissues are often collected from bioassays primarily designed for more traditional purposes. Many design issues, such as sample size or power, can largely be addressed through standard procedures for multi-factor designs. The goal in this paper is to describe a heuristic model for the statistical analysis that integrates the usual approach to designing bioassays with the added complications associated with measuring expression with DNA arrays.
DNA array technologies simultaneously measure transcript levels from thousands of genes per sample, and this ability to interrogate transcript levels for an entire genome motivates the interest in toxicogenomics studies. However, these measurements are generated through chemical assays comprised of several steps, which are prone to technical variation (Yang and Speed, 2002). The study design must incorporate these technical sources of variation with the biological variation in the bioassay in a way that does not compromise treatment comparisons.
The data generated in these studies are analyzed 'by gene', and the interpretation of the signifi cance of p-values associated with tests for treatment effects must cope with misclassification rates expected with such a large number of tests. In this setting, it is desirable to declare signifi cance based on false discovery rates (Allison et al. 2006;Benjamini and Hochberg, 1995;). The traditional power/sample size calculation for a single test remains useful to evaluate the relative merits of competing designs, but benefi ts from modifi cations, which arise by consideration of multiple testing.
It is common practice to transform, normalize and background correct the intensities from DNA arrays although there is no widespread agreement on the best procedures (Allison et al. 2006). Each procedure impacts the analysis, and this should be addressed in the design. While we have no special insight as to the best procedures, the specific procedures, which we use, are justifi ed in the context of an overall analysis strategy. For actual examples of analyses, see (Delongchamp et al. 2003;Delongchamp et al. 2005;Desai et al. 2007;Desai et al. 2004).
Hypotheses are tested using an analysis of covariance model to account for the variance removed by normalizing the measurements and to adjust for specifi ed sources of variation (Parrish and Delongchamp, 2006). These designs often use a standard reference sample (Desai et al. 2007;Desai et al. 2004). While a 'reference' design can be less effi cient than balanced incomplete block designs (Kerr and Churchill, 2001;Vinciotti et al. 2005), they offer versatility with respect to treatment layouts and sample sizes. A recent set of papers discusses some of these issues (Dobbin et al. 2003b;Dobbin et al, 2005;Martin-Magniette et al. 2005a;Martin-Magniette et al. 2005b). From our perspective, the goal is to design studies which reliably test treatment effects with adequate power, and the designs we use meet this goal. Designs, which estimate expression changes relative to a standard, facilitate across-study comparisons. Since functional genomic studies tend to be exploratory, follow-up studies should be anticipated and estimating effects relative to a standard has merit in this context.

Methods
At the design stage, a statistical model is a framework through which experimental design questions can be addressed. Our goal is to design a toxicogenomics study that allows valid treatment comparisons when confronted by several sources of variation. Given a design, the model allows one to incorporate the recognized sources of variation into expected mean squared errors for a statistical analysis. Before a design is implemented, its expected mean squares should demonstrate that an analysis with valid tests is possible. Usually several designs can give valid tests, and the model helps to choose among them. An example illustrates the design process.

Model for bioassay
It will be taken for granted that tissue samples and treatments have been suitably selected to address hypotheses of interest. For didactic purposes, we presume that the treatment layout of the bioassay is a fi xed effects linear model as this level of generalization accommodates the majority of toxicogenomics studies conducted at NCTR. Under a fi xed effects linear model, the 'expression' data for an arbitrary gene has an expectation and variance that can be written as At this juncture, y i denotes an observed measurement of expression, suitably transformed, for experimental unit, i, obtained at treatment levels specifi ed through a design matrix, X i , and σ 2 denotes the within-treatment variance among experimental units. Equation (1.1) models an arbitrary gene. To "scale up" to the thousands of genes which are measured per experimental unit, the parameters, β and σ 2 , are assumed to be gene specifi c. The multivariate structure is ignored in practice, and statistical analyses are applied one gene at a time Wolfi nger et al. 2001). Consequently, Equation (1.1) is adequate for this discussion. From the computation perspective, fi xed effects linear models are widely implemented in well documented statistical packages. Our experience is with SAS, and these models are easily implemented using PROC GLM (SAS, 1999). Further, this basic design structure can accommodate more sophisticated analyses, which incorporate current trends in statistical methods for DNA array data (Allison et al. 2006). It is not too diffi cult to implement shrinkage/empirical Bayes adjustments to the variance (Smyth, 2004;Wright and Simon, 2003), which will increase statistical power when sample sizes are small. We have also adapted these analyses to correlated groups of genes (e.g. gene ontology groupings) .
Without further elaboration, we presume that the set of biological samples, {X i : i = 1, …, n}, defi nes a reasonably effi cient design with respect to the hypotheses of interest, and note that the properties of experimental designs have been well studied in this setting. In our work, 'n' is arrived at through a conventional sample size/power assessment, which addresses the issue of multiple testing by reducing the signifi cance level, e.g. power is computed at signifi cance level of 0.001 instead of 0.05 for a representative value of σ 2 . This is elaborated through the example in Section 5.

Model for DNA arrays
For our purposes, a DNA array is a surface that is spotted with DNA. The spots are arranged in a grid at a regular spacing, the array. Each spot is DNA with a known nucleotide sequence that corresponds to part of a gene's DNA sequence. Hereafter, the spotted sequence will be referred to as a probe. When a mixture of DNA species is hybridized to the arrayed probes, complementary sequences bind preferentially thereby sorting the mixture into its constituents. To measure a sample, which is a mixture of mRNA species, an aliquot of the total mRNA is transcribed into complementary DNA. This cDNA is labelled with a fl orescent dye or a radioactive isotope. The labelled-cDNA is hybridized to the probes, and the intensity of label bound to each probe is measured. More than one sample can be hybridized to an array by using different labels.
There are several commercially available platforms that measure gene expression by hybridizing a labelled sample to a set of DNA probes, and their details can differ from the preceding description. Our experience has been with spotted arrays, both 'macro' arrays, e.g. www.clontech.com (Bowyer et al. 2007;Delongchamp et al. 2003;Delongchamp et al. 2005) and oligonucleotides spotted on glassslide microarrays, such as 'Mitochip' (Desai et al. 2007). Because of our experience, we describe the statistical model in the context of a spotted oligonucleotide technology, specifi cally 'Mitochip'. In large part, the statistical issues do not depend on how a DNA probe is created. That is, in situ synthesized microarrays, such as arrays manufactured by Affymetrics: www.affymetrix.com, record a label-intensity at each 'feature', which has essentially the same statistical properties as the spotted technologies. Affymetrics arrays measure multiple probes for each gene, both sequence matches and mismatches. To obtain a single expression measurement, it is necessary to collate intensities from many probes. Likewise, it is not necessary that the DNA probes are physically bound to a surface. For example, the Illumina bead platform, www.illumina.com, still obtains label-intensities associated with a set of DNA probes even though the beads are not arrayed. While PCR methods are technologically distinct, the modeling employed here extends quite easily to that technology (Desai et al. 2007).
Let P denote the set of probes on an array. Label intensities are observed at the location of each probe. These intensities need to be related to the mRNA concentrations in the original tissue. In general, a tissue sample can be assayed more than once. To accommodate this, we add a subscript to designate the aliquot measured on an array. The intensity observed at a probe (g) for an aliquot/ array (a) from experimental unit (i) will be designated I ga(i) .
In evaluating a gene's function, we attempt to observe how its mRNA concentration responds to manipulated conditions. The concentration of mRNA for gene (g) in a sample of tissue (i) will be denoted as C gi . It is usually assumed that the intensity is proportional to the mRNA concentration in the sample, I ga(i) ∝ C gi . In general, this assumption is not correct unless background contributions to the intensity can be ignored and probe amounts are in excess of amounts in the aliquot hybridized to the array (Held et al. 2003;Held et al. 2006;Zhang et al. 2007). These issues are limitations of the technology that are not directly addressed by the design considerations discussed here.
For two samples, no difference in the expression of a gene is taken to mean that the mRNA concentrations are equal, that is C gi = C g j . The total concentration of mRNA in a sample is the sum of concentrations from the constituent mRNAs (genes). Since C gi = 0 when a mRNA species is absent, the total concentration can be written as the sum over genes that comprise the genome. Let G denote this set, and write this sum as ( 1.2) Then the proportion of the total mRNA that arises from gene (g) is An assay of gene expression begins with a fi xed amount (w μg) of the total mRNA, which was recovered from a tissue sample, so that π gi w μg of mRNA from gene 'g' is in the starting aliquot. Note that equal amounts of a specifi c mRNA in two starting aliquots do not imply that the initial samples have equal expression since π gi w = π g j w does not imply that C gi = C g j . For equal amounts in aliquots to imply equal concentrations in samples, the total mRNA must be the same for both samples, that is This usually would not be the case if some genes in these samples have differential expression. In order to 'normalize' the amounts in aliquots so that they refl ect the concentrations in tissues, one needs to know or estimate the ratio, The essential chemistry is the specifi city of DNA hybridizations. For every probe on the array, g ∈P, and gene in the genome, h ∈G, there is a binding affi nity, κ gh . If the gene-specifi c amounts of labelled cDNA are proportional to the corresponding mRNA in the aliquot, the intensity will be proportional to the relative concentration of gene 'g', κ gg π gi , plus a background, κ π gh hi h h g , , ∈ ≠ ∑ G which arises from binding with DNA from other genes in the sample. That is, (1.4) Successful sorting implies that the binding to probe 'g' is much larger for labelled cDNA from gene 'g' than for labelled cDNA from any other gene, i.e. κ gg >> κ gh when h ≠ g. Since the background should be small, it is of little consequence when the interrogated gene is highly expressed, in which case: I ga(i) ∝ π gi . Background is problematic when π gi is also small. At least for this discussion, this problem with background is viewed as a limitation of the technology that biases estimated effects.
As an operational model, we ignore background and assume that the observed intensity is reasonably approximated as The observed log-intensity depends upon mRNA concentration in the sample, log 2 Cgi , as well as a proportionality term associated with processing the aliquot, K ga , and a 'normalizing' term, log 2 C •i .

Technical variation: spot, dye and batch effects
Whenever two samples are hybridized to a single array, they share some sources of variation. In particular, hybridization conditions, DNA binding sites, and spot alignments will be identical. Common conditions make intensities for these samples more alike than those from samples that have been measured on other arrays. That is, samples measured on different arrays exhibit more technical variation, which we refer to as a spot effect. In order to hybridize two samples on an array, they must be labelled with different dyes. There also is a dye effect. Spot and dye effects can be directly estimated by a dye-fl ip between two samples ).
Consider two samples, i and j, which are measured with a dye flip on two arrays. Table 1 illustrates the layout, where we let y = log 2 I to streamline the notation. By measuring each sample twice, the array/spot effect and the dye effect are estimable and they can be eliminated from the estimate of the treatment effect. In this context, the dye effect is defi ned to be the difference between the average of intensities for the respective dyes, i.e.
At least heuristically these equations can be related to Equation 1.5 as follows: That is, the proportionality constant in Equation 1.5 depends on spot and dye effects which can be estimated and eliminated via this dye fl ip. In particular, t = log 2 π i − log 2 π j = (y 11 + y 22 )/2 − (y 12 + y 21 )/2.
A dye-fl ip works well when there are two treatments since the treatments can be blocked within spots and dyes (Delongchamp et al. 2005). However, it becomes increasingly diffi cult to manage technical sources as the number of treatments is increased. A straightforward solution is to do the dye fl ip with a standard sample, e.g. (Desai et al. 2004), where the standard is characterized as a sample which has a constant expression level throughout the study.
A dye-fl ip with a standard uses two arrays per sample and doubles the cost and work of assaying each sample. However, each sample is measured twice, thereby reducing the technical error. This can be a reasonable way to increase the precision of estimated treatment effects whenever tissue samples are at a premium. At least conceptually, this type of dye fl ip eliminates spot effects and dye effects yielding adjusted estimates of expression in the samples relative to expression in the standard. In earlier work, there was an interest in mining databases created from several studies, and estimates of expression relative to a common standard seemed desirable. In practice, the sequences used for probes, assay conditions, etc., change from study to study and these changes undermine the utility of measuring expression relative to a standard.
In current work, we usually label the sample with CY5 and a universal reference with CY3. The difference between the log-intensities of the sample and reference eliminates the spot effect but not the dye effect, y 11 − y 21 = t + d. The dye effect usually differs for each probe. Within a probe, this effect appears to be reasonably stable across samples, and in recent studies we presume it is a constant bias in the difference between the sample and the standard. Because we block on batches, a dye 'bias' only needs to be constant within batches to manage this effect.
In addition to variation associated with spots and dyes, we explicitly design studies to manage technical sources of variation that can be associated with concurrent processing conditions (batches). Samples assayed concurrently share processing conditions making their measurements more homogeneous than measurements made at other times. Some of these sources of variation may not be trivial resulting in a 'batch' effect. In general, all of the samples in a study cannot be processed concurrently. So, the batch effect should be incorporated into the experimental design.

Combined model
Let y gi be a difference between a sample and the standard, This gives To eliminate log 2 (C •i /C •s ) from the statistical model, we use a normalizing covariate, z i . Many normalization methods can be implemented through an analysis of covariance (Parrish and Delongchamp, 2006). In the example, we normalize using housekeeping genes and the following details are for that case. Defi ne housekeeping genes, g ∈H, as a set of genes that satisfy C gi = C gj for all samples, i.e. genes which are not affected by the treatments. Let the normalizing covariate be the average of log-intensities for the housekeeping genes from a sample, i.e. z h y log ( / ) and note that this quantity is the same for all samples, i = 1,…,n, because of the a priori assumption that these are housekeeping genes. Equation (1.6) implies By combining Equations (1.6) and (1.7), the expectation of an observed intensity is This relationship is an analysis of covariance model, E log , where the location shift (bias) implied by [ ] log 2 H is absorbed into the overall mean of the linear model, X i β g , and the parameter, γ g , captures the attenuation associated with replacing the expectation, E(z i ), with an estimate, z i . This statistical model is an analysis of covariance model with two generic sources of variation, biological and technical. Biological variation should be estimated through replicate samples within treatments. Technical variation arises from sources connected with the sample-processing steps in the measurement of gene expression. These sources are associated with spots, dyes and batchs.

Example
The example discusses the design of a toxicogenomics study to evaluate treatment effects on mRNA transcripts involved in mitochondrial function. It illustrates a situation where tissues are collected from a bioassay designed primarily for other purposes. In this case, a bioassay was conducted at NCTR for the National Toxicology Program, which is assessing the carcinogenic potential of AZT when it is administered to prevent mother to child transmission of AIDS. Muscle tissue was collected from 120 mice, 6 mice per treatment from 20 treatments: 2 sexes × 2 sacrifi ce times × 5 levels of AZT exposure. AZT is known to affect the mitochondria and the microarray assay is intended to complement an overall toxicity assessment by testing for treatment effects on transcript levels of 542 genes important to the structure and/ or function of mitochondria (Desai et al. 2007).
The 20 'treatments' in this bioassay are arranged in a factorial design of 5 AZT dosing regimes in both sexes with tissues collected at either a one or two hour interval after the fi nal AZT exposure. This arrangement of treatments can be parameterized in several equivalent ways. Table 2 gives one parameterization, which serves as the basis of discussions to follow. Table 2 is a standard partitioning of the sums of squares into sources of variation, and it corresponds to a fi xed effects linear model, Equation (1.1). The statistical signifi cance of each source would be evaluated through an analysis of variance. In the context of measuring gene expression, this defi nes the statistical analysis of the data for an arbitrary gene if its mRNA concentrations in each mouse were measured with a simple additive error, i.e. Equation (1.1).
In the design we specifi cally want to account for biological variation associated with mice ( σ m 2 ) and technical variation associated with batches (σ b 2 ), as well as spots and dyes. Effects of spots and dyes are eliminated by analyzing the difference between the log-intensities on the red and green channels. Variation (e.g. amount and uniformity of the spotting) associated with the probe is 'removed' by subtracting the CY5 log-intensity from the CY3 log-intensity. The sample is labelled with CY5 and a universal reference is labelled with CY3, which effectively removes dye effects from treatment contrasts (Dobbin et al. 2003a;Dobbin et al. 2003b).
In addition, each probe is spotted twice on the array providing a sub-sampling error. Conceptually, this source of variation is a lower bound on the technical variation which can be partitioned from the residual technical variation (σ 2 ). This error can be quite useful for diagnostic purposes, and it is routinely evaluated at the analysis stage (Delongchamp et al. 2005). Because this level of detail does not alter the main design issues, this component was absorbed into the residual in the expected mean squares that are presented here. That is, the tabled expectations assume that the log-intensity differences for the two spots have been averaged.
The design has to accommodate two logistic constraints. First, the amount of mRNA that can be extracted reliably from each sample is anticipated to be near the minimum required for a microarray assay. We formally considered 2 options, 1) using 4 mice (assumes at least 4 among the 6 mice will have enough mRNA) or 2) pooling the mRNA from 2 mice so that the 6 mice form 3 mRNA samples. The second constraint is that the microarray laboratory can concurrently process 6 to 8 tissue samples on DNA arrays. Consequently, arrays for 20 treatments cannot be processed as a batch. Since the primary interest is in AZT effects, we chose to process 5 samples as a batch, one sample from each AZT treatment within a sacrifi ce time and sex. Table 3 outlines the degrees of freedom and expected mean squares under the first option. Table 4 outlines the degrees of freedom and expected mean squares under the second option, where it is assumed that pooling 2 mice reduces the biological variance to σ m 2 2 / . Note that the fi rst option uses 80 arrays while the second option uses 60 arrays. Also the fi rst option is more 'iffy' since one cannot guarantee that 4 of the samples will have enough mRNA for a microarray assay. Consequently, option 2 is preferred unless the statistical power under option 1 is substantially better.
It is not entirely clear how to evaluate the power of competing designs because of the large number of comparisons (Tsai et al. 2005). In this setting, the signifi cance of a p-value is diffi cult to interpret. In our analyses, the genes are ordered by their p-value, and a gene with a smaller p-value is presumed to be more signifi cant . Signifi cant genes are identifi ed by a post hoc evaluation of the false discovery rate associated with any partition of the ordered genes into an 'affected' set and the remainder, presumed not to be affected. We implement this analysis strategy at the design stage by decreasing the signifi cance level used in a conventional power computation. In the power calculation, we want to detect most of the genes that have suffi ciently large treatmentinduced changes, say greater than 1.5 fold. That is, if a gene has a 1.5-fold treatment-induced change, the statistical test should have a reasonably small p-value, say p ≈ 1/(# genes on the array), Table 3. Expected mean squared errors for option 1 of the combined model. Table shows the expected mean squared errors associated with a combined model. This layout corresponds to a 2 × 2 × 5 factorial arrangement of sex, time, AZT and their interactions with the 80 muscle samples processed in 16 batches (blocks); each batch has 5 samples, one sample for each AZT level within a sex and a sacrifi ce time.

Source
DF Expected mean square (approx.) Mean: Housekeeping genes 1  Tables 3 or 4, which are gene-specific and unknown (Tsai et al. 2005). Because so many important parameters are unknown, a comprehensive assessment of power is not attempted. For a ball park assessment of competing designs, the power of a two-sided t-test for an interesting contrast can be computed using the median values of variance components from a previous study. Typically, one must choose somewhat arbitrarily among available studies attempting to most closely match the array, tissue and mouse strain of the current study. In this case, data from an AZT study using the same array on liver samples from the same mouse strain were available (Desai et al. 2007). This study gave the following medians for variance components of the 542 mitochondrial genes on the array: Under the factorial design (Tables 3 or 4), there are a number of hypotheses and sub-hypotheses that can be tested. We evaluate the overall performance of a design by specifying the power for a few contrasts, which we deem to be most important.
Factorial designs have the most power to detect the main effects (in this case: 'AZT', 'Sex', and 'Time'), but contrasts among levels of the main effects are not the most interesting when interactions are anticipated. In toxicological studies, the effect of a dosing regime is frequently believed to depend on the levels of other factors, which are included in the study to broaden the scope of the risk assessment. This is the case here where the mitochondrial toxicogenomics associated with AZT treatment may also depend on sex or sacrifi ce time (sacrifi ce times were based upon the pharmacokinetics of AZT metabolism in these mice). When interactions are signifi cant, interpretations of the results depend on contrasts within levels of other factors. With this appraisal in mind, we evaluated the power for two 'key' contrasts, an AZT contrast within a sex and sacrifi ce time and a sex contrast within a sacrifi ce time. These contrasts use different error terms (Tables 3 or 4); the AZT contrast uses 'Mice within Treatments' and the sex contrast uses 'Batch'. Figure 1 plots the power as a function of fold-change for these two contrasts under both design options. All four curves have adequate power to pick up 1.5 fold-changes at the 0.002 level of signifi cance. Because both options have adequate power, option 2 was adopted as this design has better cost/logistic properties. When substantial batch effects are present but ignored in the design and subsequent analyses, the computed F-ratios are unlikely to be valid tests of the nominal hypotheses. In general, the specifi c consequence of ignoring substantial batch effects would not be very tractable since they are unmanaged in the design. However for the design in Table 4. Expected mean squared errors for option 2 of the combined model. Table shows the expected mean squared errors associated with a combined model. This layout corresponds to a 2 × 2 × 5 factorial arrangement of sex, time, AZT and their interactions with the 60 pooled muscle samples (2 samples per pool) processed in 12 batches (blocks); each batch has 5 pools, one pool for each AZT level within a sex and a sacrifi ce time.

Source DF Expected mean square (approx.)
Mean: Housekeeping genes 1 Table 4, the batch effects are nested within sex and time effects and the result of disregarding batch effects can easily be evaluated. Under this design, proper analyses partition the variation as presented in Table 4. If an analyst were to disregard the 'batches' component such that this variation is simply added to the 'mice within treatments' component, then the pooled error term would have an expected mean squared error of approximately , which would be estimated with 39 degrees of freedom. This 'pooled' estimate would misrepresent the appropriate error variances by over estimating the error variance of AZT treatments and under estimating the error variance of sex or time effects. The F-ratios would not be valid tests of the nominal hypotheses and would result in fewer AZT effects, more sex effects and more time effects than appropriate.
Several other designs are valid. They were not formally considered because they represent different priorities as to what effects are important. For example, an alternative is to process the arrays separately as 'batches' of one, in that sense ignoring batch. This design provides valid tests for the interesting factors by incorporating the variation associated with batches into the residual (Table 5). Compared to the design of Table 4, this design provides more precise estimates of the sex and time effects, but less precise estimates of AZT effects. Our preference for the design in Table 4 refl ects our desire to estimate AZT effects, and an a priori belief that sex and time effects are less important. In addition, the design in Table 5 does not utilize the lab effi ciently making this design more expensive. The lab is capable of processing six to eight arrays as a batch and this design processes one.

Discussion
When DNA arrays are used to measure gene expression, several aspects of the technology should play a part in the experiment's design. The observed intensities must be normalized for inferences to properly refl ect treatment differences. Since normalization removes a major part of the variation, implementing the adjustment as a covariate gives a direct assessment of its impact. The assay of samples involves several steps and arrays processed concurrently produce measurements that are more homogeneous. Whenever samples are assayed in batches, the design should manage this effect. Likewise, when multiple (usually two) samples are hybridized to an array the measurements will be more homogeneous. Consequently, Figure 1. Power versus fold change for two key contrasts. The fi gure plots the power for a two-sided t-test at the 0.002 signifi cance level as a function of the fold change. Blue lines give the power for a sex difference within a sacrifi ce time, and red lines give the power for a difference between two AZT levels within a sex and sacrifi ce time. Solid lines use the error for Option 1 (Table 3) and the dashed lines use the error for Option 2 (Table 4). designs need to address variation from batches, dyes and arrays as well as the variation among samples (experimental units).
We view toxicogenomics studies as assays of mRNA in tissue samples nested within a bioassay of treatments. The purpose of the study is to estimate treatment differences in gene expression. Statistical tests of treatment effects require a valid assessment of within-treatment variation, which will generally be the variation among tissue samples from treated animals. As the example illustrates, tissue samples can be pooled but the design needs to estimate the variation among experimental units (pools in this example). The technical sources of variation associated with each mRNA assay (array) primarily need to be managed so that variance among experimental units can be estimated. We accomplish this by blocking on samples that are processed concurrently and samples measured on the same array.
As the example illustrates, samples often come from studies which were originally designed for reasons other than studying gene expression. In such cases, the design is constrained by the available samples. The use of standard samples adjusts simply for probe-specifi c effects and designs having valid tests, which adjust for batch, array, and dye, can be found for virtually any sample size.
Although background expression can be ignored at the design stage, some kind of strategy should address background at the analysis stage. A common strategy is to subtract an estimate of the background from the observed intensity. This strategy is not recommended for a couple of reasons. First, the usual estimates of background available to the analyst simply do not measure the appropriate background. Second, subtraction generates negative intensities which cannot be log-transformed and implemented in the planned analysis. In essence, our strategy is to only pursue treatment effects among those probes for which background expression is judged to be negligible (Delongchamp et al. 2005). Actually, all probes are analyzed but those with 'background' levels of expression for all samples are classifi ed as 'not expressed'. The results of analyses on these genes are separated from those for the 'expressed' genes. The 'not expressed' genes can and often do indicate problems with the statistical model . Genes that are not expressed cannot be differentially expressed. So, the empirical distribution of p-values from tests of treatment effects should follow the null distribution. When this empirical distribution is not consistent with a uniform distribution, one of several potential problems is an inappropriate design.
Normalization is necessary and a diffi cult problem in practice. We have avoided the issues by assuming the existence of a normalizing covariate. This is a reasonable approach when developing an experiment's design, but at the time of analysis one needs to produce a covariate. There are a number of methods which have been proposed for normalization and many can be implemented through an analysis of covariance (Parrish and Delongchamp, 2006). Normalization methods potentially mitigate effects more properly considered as batch effects or array effects in addition to addressing the adjustment discussed in this paper. We caution against relying on normalization adjustments in lieu of blocking because they can introduce large biases (Delongchamp and Kodell, 2004;Parrish and Delongchamp, 2006). Table 5. Expected mean squared errors for option 2 of the combined model with one array per batch. Table shows the expected mean squared errors associated with a combined model. This layout corresponds to a 2 × 2 × 5 factorial arrangement of sex, time, AZT and their interactions with the 60 pooled muscle samples (2 samples per pool) processed in 60 batches (blocks); each batch has 1 pool.

Conclusions
Many toxicogenomics studies are bioassays which measure gene expression using DNA arrays. These studies can be designed and analyzed using standard methods with a few modifi cations to account for characteristics of array measurements, such as multiple endpoints and normalization. As much as possible, technical variation associated with probes, dyes, and batches are managed by blocking treatments within these sources of variation. An example shows how some practical constraints can be accommodated by this modelling and how it allows one to objectively compare competing designs.

Authors' contributions
Robert R. Delongchamp and Cruz Velasco developed the experimental designs in consultation with all authors. Robert R. Delongchamp wrote the manuscript. Cruz Velasco reviewed literatures. Varsha G. Desai, Taewon Lee and James C. Fuscoe provided the example. All authors read and approved the fi nal manuscript.